Sarah A. Thomas

Project 6 - Thera Bank

Description (copied from assignment)

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Data Dictionary:

1 - Load Packages and Read in the Dataset

2 - Data Preprocessing

2.1 - Check the first and last 10 rows of the dataset and random 10 rows

2.2 - Check the shape of the data

2.3 - Check the datatypes and rows counts for each column

Observations:

2.4 - Check for duplicates and drop CLIENTNUM

2.5 - Convert Categoricals

3 - EDA

3.1 - Explore Summary of Data

Observations:

3.2 - Univariate Analysis

3.2.1 - Customer_Age

3.2.2 - Dependent_count

3.2.3 - Months_on_book

3.2.4 - Total_Relationship_Count

3.2.5 - Months_Inactive_12_mon

3.2.6 - Contacts_Count_12_mon

3.2.7 - Credit_Limit

3.2.8 - Total_Revolving_Bal

3.2.9 - Avg_Open_To_Buy

3.2.10 - Total_Trans_Amt

3.2.11 - Total_Trans_Ct

3.2.12 - Total_Ct_Chng_Q4_Q1

3.2.13 - Total_Amt_Chng_Q4_Q1

3.2.14 - Avg_Utilization_Ratio

3.2.15 - Attrition_Flag

3.2.16 - Gender

3.2.17 - Education_Level

3.2.18 - Marital_Status

3.2.19 - Income_Category

3.2.20 - Card_Category

3.3 - Bivariate Analysis

3.3.1 - Correlation

The following are highly correlated:

The following are moderately correlated:

The following are slightly correlated:

3.3.2 - Customer_Age vs Attrition_Flag

3.3.3 - Gender vs Attrition_Flag

3.3.4 - Dependent_count vs Attrition_Flag

3.3.5 - Education_Level vs Attrition_Flag

3.3.6 - Marital_Status vs Attrition_Flag

3.3.7 - Income_Category vs Attrition_Flag

3.3.8 - Card_Category vs Attrition_Flag

3.3.9 - Months_on_book vs Attrition_Flag

3.3.10 - Total_Relationship_Count vs Attrition_Flag

3.3.11 - Months_Inactive_12_mon vs Attrition_Flag

3.3.12 - Contacts_Count_12_mon vs Attrition_Flag

3.3.13 - Credit_Limit vs Attrition_Flag

3.3.14 - Total_Revolving_Bal vs Attrition_Flag

3.3.15 - Avg_Open_To_Buy vs Attrition_Flag

3.3.16 - Total_Trans_Amt vs Attrition_Flag

3.3.17 - Total_Trans_Ct vs Attrition_Flag

3.3.18 - Total_Ct_Chng_Q4_Q1 vs Attrition_Flag

3.3.19 - Total_Amt_Chng_Q4_Q1 vs Attrition_Flag

3.3.20 - Avg_Utilization_Ratio vs Attrition_Flag

4 - Model Building - Data Preparation

4.1 - Split the Data

4.2 - Missing Value Treatment

4.3 - Dummy Variables

4.4 - Model Evaluation Criterion

Model can make the following wrong predictions:

  1. Predicting customer attrited but they did not (False Positive)
  2. Predicting that customer didn't attrite but they did (False Negative)

Which case is more important?

How to reduce this loss, i.e., need to reduce False Negatives?

5 - Model Building and Evaluation

5.1 - Build 6 Models

5.2 - RandomizedSearchCV

5.2.1 - Decision Tree

5.2.2 - Adaboost

5.2.3 - GBM

5.3 - Oversampling train data using SMOTE

5.3.1 - Decision Tree on oversampled data

Performance Evaluation Using KFold and cross_val_score

5.3.2 - AdaBoost on oversampled data

Performance Evaluation Using KFold and cross_val_score

5.3.3 - GMB on oversampled data

Performance Evaluation Using KFold and cross_val_score

5.4 - Undersampling train data using Random Under Sampler

5.4.1 - Decision Tree on undersampled data

Performance Evaluation Using KFold and cross_val_score

5.4.2 - AdaBoost on undersampled data

Performance Evaluation Using KFold and cross_val_score

5.4.2 - GBM on undersampled data

Performance Evaluation Using KFold and cross_val_score

6 - Comparing All Models

Performance on the Test Set

7 - Pipelines for Productionizing the Model

Creating 2 pipelines - one will be for numerical columns (missing value imputation) and one will be for categorical columns (one hot encoding and missing value imputation).

8 - Business Recommendations